IMDb (an acronym for Internet Movie Database) is an online database of information mainly related to films and television series.

The movie industry is increasingly growing each year alongside the spread of streaming services and the huge amounts of money invested in it. Shaping culture and new trends is enough evidence that it is quite influential, which brings interest to analyze it using the IMDB dataset and try to answer some questions that some might be interested in.

1 Used Libraries

library(dplyr) # Data Manipulation
library(readr) # Reading data
library(tidyr) # Data Cleaning
library(ggplot2) # Data Visualization
library(plotly) # Interactive plots
library(lubridate) # Dates Manipulation
library(anytime) # Dates Manipulation
library(scales) # Adding dollar signs
library(wordcloud2) # Wordcloud plot
library(tm) # Text Mining
library(caret) # Modelling

2 Objectives

Clean and Extract useful data from the dataset.
Analyze the movies, genres and directors info.
Explore the financial side of the industry.
Build a model to predict the rating of a movie.

3 Data Cleaning

# Reading the data
data <- read_csv("imdb_data.csv")

Rows: 3348 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): id, primaryTitle, originalTitle, genres, release_date, directors
dbl (6): isAdult, runtimeMinutes, averageRating, numVotes, budget, gross

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Checking the data
glimpse(data)

Rows: 3,348
Columns: 12
$ id             <chr> "tt0035423", "tt0065421", "tt0065938", "tt0066026", "tt…
$ primaryTitle   <chr> "Kate & Leopold", "The Aristocats", "Kelly's Heroes", "…
$ originalTitle  <chr> "Kate & Leopold", "The AristoCats", "Kelly's Heroes", "…
$ isAdult        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ runtimeMinutes <dbl> 118, 78, 144, 116, 172, 86, 136, 120, 102, 104, 91, 118…
$ genres         <chr> "Comedy,Fantasy,Romance", "Adventure,Animation,Comedy",…
$ averageRating  <dbl> 6.4, 7.1, 7.6, 7.4, 7.9, 6.6, 8.3, 6.5, 7.7, 7.7, 7.9, …
$ numVotes       <dbl> 87925, 111758, 52628, 75784, 106476, 53906, 864440, 112…
$ budget         <dbl> 48000000, 4000000, 4000000, 3500000, 12000000, 777000, …
$ gross          <dbl> 76019048, 35459543, 5200000, 81600000, 61749765, 243700…
$ release_date   <chr> "December 11, 2001", "December 11, 1970", "January 1, 1…
$ directors      <chr> "James Mangold", "Wolfgang Reitherman", "Brian G. Hutto…

# There's two columns for the title, checking whether it's duplicated or not
summary(data$primaryTitle == data$originalTitle)

   Mode   FALSE    TRUE 
logical     145    3203

# Checking for duplicated observations
summary(duplicated(data))

   Mode   FALSE 
logical    3348

# We'll have to keep the two columns as there's 145 titles that are different

# The column isAdult is filled with zeros therefore it won't be used
summary(data$isAdult == 0)

   Mode    TRUE 
logical    3348

clean_data <- data %>% select(-isAdult)

# Checking crucial columns for NAs
summary(is.na(data$runtimeMinutes))

   Mode   FALSE 
logical    3348

summary(is.na(data$directors))

   Mode   FALSE 
logical    3348

summary(is.na(data$genres))

   Mode   FALSE 
logical    3348

summary(is.na(clean_data$release_date))

   Mode   FALSE    TRUE 
logical    3343       5

# 5 movies are without a release date which could be compensated easily
movie_names <- unlist(as.vector(clean_data %>% 
                                  select(primaryTitle) %>%
                                  filter(is.na(clean_data$release_date))))
movie_names

     primaryTitle1      primaryTitle2      primaryTitle3      primaryTitle4 
   "Groundhog Day"        "Quiz Show"       "Shark Tale"       "Waiting..." 
     primaryTitle5 
"Midnight Special"

# Extracting the dates using Google
# (ordered the same order as the print result of movie_names)
dates <- c("February 12, 1993", "September 14, 1994", "October 1, 2004",
           "October 7, 2005", "April 21, 2016")
full_data <- clean_data
for(i in 1:length(movie_names)){
  full_data$release_date[full_data$primaryTitle == movie_names[i]] <- dates[i]
}
# Verifying the results
full_data %>% 
  filter(primaryTitle %in% movie_names) %>% 
  select(primaryTitle, release_date)

summary(is.na(full_data$release_date))

   Mode   FALSE 
logical    3348

# Sampling the data revealed that the dates don't share the same format
# (I discovered this while trying to compensate the 5 missing dates)
set.seed(3)
sample(full_data$release_date, 50, replace = T)

 [1] "February 11, 2015"  "1996"               "December 18, 2000" 
 [4] "August 8, 2012"     "July 10, 2015"      "February 10, 2016" 
 [7] "May 11, 2000"       "December 27, 1991"  "March 20, 1998"    
[10] "1989"               "November 10, 2011"  "May 3, 2002"       
[13] "August 2021"        "May 24, 2002"       "July 11, 2005"     
[16] "December 14, 2009"  "December 25, 2013"  "December 7, 2001"  
[19] "July 1, 1987"       "October 30, 2022"   "May 21, 2019"      
[22] "September 8, 2017"  "October 20, 2010"   "June 21, 2015"     
[25] "June 22, 1990"      "November 13, 1987"  "August 7, 2019"    
[28] "June 14, 2002"      "May 30, 1995"       "October 2007"      
[31] "May 13, 2000"       "November 2, 2000"   "October 2000"      
[34] "June 7, 2005"       "May 19, 2014"       "July 30, 2010"     
[37] "June 13, 2022"      "November 8, 1995"   "June 14, 2018"     
[40] "January 25, 2003"   "September 4, 2017"  "September 2, 2013" 
[43] "September 19, 1980" "June 24, 2019"      "May 21, 2015"      
[46] "July 21, 2007"      "December 1, 1982"   "January 12, 2022"  
[49] "August 22, 2012"    "January 30, 2014"

# Converting the char format dates to POSIXct
summary(is.na(ymd(full_data$release_date)))

Warning: All formats failed to parse. No formats found.

   Mode    TRUE 
logical    3348

summary(is.na(anydate(full_data$release_date)))

   Mode   FALSE    TRUE 
logical    3298      50

summary(is.na(as.Date(full_data$release_date, format = c("%Y", "%B %Y", "%B %d, %Y"))))

   Mode   FALSE    TRUE 
logical    1099    2249

All 3 functions won’t work, so I had to search for a better solution which I found thanks to this page, parse_date_time tries to parse the dates using the patterns in orders; the orders vector is composed of the possible formats that could be found in the release_date column but the downside is that movies having only a year as their date will be parsed into the date of the first day of that year. (same thing with month, year format)

new_data <- full_data %>%
  mutate(release_date = parse_date_time(release_date,
                                        orders = c('mdy', 'dmy','my','y')))
# Checking the results of the parse_date_date
sample(new_data$release_date, 50, replace = T)

 [1] "2018-09-08 UTC" "1984-12-14 UTC" "2010-09-01 UTC" "1986-01-17 UTC"
 [5] "2015-07-10 UTC" "2001-12-18 UTC" "2003-06-30 UTC" "2012-09-07 UTC"
 [9] "2018-11-11 UTC" "2010-02-25 UTC" "1996-01-01 UTC" "1997-12-19 UTC"
[13] "2010-09-08 UTC" "2007-10-25 UTC" "2001-10-12 UTC" "2018-05-10 UTC"
[17] "2011-05-15 UTC" "2012-08-08 UTC" "2022-09-01 UTC" "1978-10-25 UTC"
[21] "2011-01-22 UTC" "2009-03-06 UTC" "1985-08-19 UTC" "1995-05-19 UTC"
[25] "2002-10-23 UTC" "2010-10-26 UTC" "2006-09-08 UTC" "1998-06-30 UTC"
[29] "2015-05-20 UTC" "1999-02-05 UTC" "2003-05-18 UTC" "2001-10-19 UTC"
[33] "1978-12-08 UTC" "1998-06-12 UTC" "2001-07-04 UTC" "2005-07-17 UTC"
[37] "2006-04-19 UTC" "2013-06-21 UTC" "2007-03-08 UTC" "2012-05-24 UTC"
[41] "2008-06-12 UTC" "2001-11-17 UTC" "1970-01-25 UTC" "1999-07-22 UTC"
[45] "2000-05-14 UTC" "2006-07-25 UTC" "2008-06-30 UTC" "2011-09-15 UTC"
[49] "2001-05-09 UTC" "1994-11-11 UTC"

4 Data Extraction

# Adding a net_profit column(despite having 51 movies without a gross value)
summary(is.na(new_data$budget))

   Mode   FALSE 
logical    3348

summary(is.na(new_data$gross))

   Mode   FALSE    TRUE 
logical    3297      51

new_data <- new_data %>% mutate(net_profit = gross - budget)
summary(new_data$net_profit)

      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
-199821857   11458340   50844775  117730313  140930148 2686706026         51

# After exploring the summary, the values are too large to only display in USD
# so we'll be adding new columns containing the values in millions of USD
final_data <- new_data %>%
  mutate(budget_M = budget / 1e+6,
         gross_M = gross / 1e+6,
         net_profit_M = net_profit / 1e+6) %>%
  select(-c(budget, gross, net_profit))
summary(final_data$net_profit_M)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-199.82   11.46   50.84  117.73  140.93 2686.71      51

# Grouping by ROI classifications
final_data <- final_data %>% 
  mutate(roi_percentage = (net_profit_M / budget_M) * 100,
        roi_group = factor(case_when(
        roi_percentage < 0 ~ "Loss",
        roi_percentage > 0 & roi_percentage <= 100 ~ "Profit <= 100%",
        roi_percentage > 100 ~ "Profit > 100%"
    )))
summary(final_data$roi_group)

          Loss Profit <= 100%  Profit > 100%           NA's 
           495            682           2120             51

# Extracting the genres and counting their occurrences and their performances
genres_list <- final_data %>%
  # separate_longer_delim is used to dedicate new rows for each genre in the column
  separate_longer_delim(genres, delim = ",") %>%
  select(genres, averageRating, net_profit_M) %>%
  group_by(genres) %>%
  summarise(count = n(),
            avg_rating = mean(averageRating, na.rm = T),
            avg_profit = mean(net_profit_M, na.rm = T),
            total_profit = sum(net_profit_M, na.rm = T)) %>%
  arrange(desc(count))

genres_list

# Similarly, extracting the directors and calculating their occurrences
# and their performances
directors_list <- final_data %>%
  separate_longer_delim(directors, delim = ',') %>%
  select(directors, averageRating, net_profit_M) %>%
  group_by(directors)%>%
  summarise(movie_num = n(),
            avg_rating = mean(averageRating, na.rm = T),
            avg_profit = mean(net_profit_M, na.rm = T),
            total_profit = sum(net_profit_M, na.rm = T)) %>%
  arrange(desc(movie_num)) %>% 
  rename('name' = 'directors')
directors_list

5 Exploratory Analysis

5.1 Movies

5.1.1 Which year holds the most number of produced movies?

The increase in produced movies is supported by the rise of digital media and the boom of social media in an era where everyone was still fascinated by new technologies at the time, with a noticeable decrease in 2017-2019 mainly due to the maturation of the field prioritizing quality over quantity.

Only for it all to fall hard after being hit with COVID-19 and the restrictions at the time.

5.1.2 What about fan favorite movies?

Surprisingly the fan favorite movie wasn’t a huge success when it comes to net profit (hover over bars for more info), with 5 movies out of 10 from ’72 to ’94 thriving on a small budget compared to ones like The Dark Knight in 2008 with $185M! Which reflects the impressive growth and interest that movies gained along the way.

5.1.3 Most hated movies?

With 73,940 users rating it with the lowest possible rating, the movie Reis was also a total failure when it comes to net profit as it only generated a very low income of $538k compared to $8M as a budget that is approximately 15 times that amount!

Contrarily, some movies managed to generate an impressive net profit despite the very low rating; noticeably Justin Bieber: Never Say Never with $99M.

5.1.4 What are the most words used in movie titles?

After excluding the word the for its huge count number of 2082, we notice that many top used words in titles are among the most used ones in our day to day life.

5.2 Genres

5.2.1 How does the distribution of the top 10 most produced genres look like?

5.2.2 What are the top 5 most profitable genres?

5.2.3 What are the 10 top most rated genres with more than 200 movies?

5.3 Directors

5.3.1 Who are the most profitable directors?

5.3.2 Who are the directors who have directed the most movies alongside their average movie rating?

5.4 Finances

5.4.1 Which year is the most profitable?

The movie industry peaked in 2019 despite a falling number of produced movies, which enforces the saying that quantity doesn’t necessarily mean quality, only to be heavily hit in 2020 due to the COVID-19 outbreak although recovering steadily afterwards.

5.4.2 How does the ROI distribution look like?

Remarkably, the movie industry is a profitable one with only 15% of the movies resulting in a loss and a staggering 64% of the movies having a ROI greater than 100%.

5.4.3 Is there a relationship between budget and the ROI group?

Nothing unusual to note as we only remark that usually movies with a loss have lower budgets than those with a profit, and the ones with a profit greater than 100% tend to have more extreme budgets.

6 Model Development & Deployment

6.1 Preparing Data

model_data <- final_data %>%
  separate_wider_delim(genres, delim = ",", names_sep = "",
                       too_few = "align_start") %>%
  separate_wider_delim(directors, delim = ",", names_sep = "",
                       too_few = "align_start") %>%
  select(-c(id, primaryTitle, originalTitle, roi_percentage, roi_group,
            genres2, genres3, directors2, directors3)) %>%
  mutate(release_year = factor(year(release_date)),
         directors1 = factor(directors1),
         genres1 = factor(genres1),
         .keep = "unused")

Splitting the genres and directors columns helps dedicating a column for each value, although later on we’ll only be using the first ones that has no null values for simplicity sake and to avoid issues surrounding missing data in the rest of columns.

genres_count <- model_data %>%
  group_by(genres1) %>%
  summarise(count = n()) %>%
  filter(count < 30)

genres_count

In order to achieve a successful split for our data we should exclude genres with too few data, similarly for directors.

directors_count <- model_data %>%
  group_by(directors1) %>%
  summarise(count = n()) %>%
  filter(count < 6)

# Columns to delete
genres_del <- genres_count$genres1
directors_del <- directors_count$directors1

cat("Dataset rows before deleting: ", nrow(model_data), "\n")

Dataset rows before deleting:  3348

model_data <- model_data %>%
  filter(!(genres1 %in% genres_del) & !(directors1 %in% directors_del))

cat("Dataset rows after deleting: ", nrow(model_data), "\n")

Dataset rows after deleting:  1220

6.2 Splitting Data

set.seed(8)
split <- createDataPartition(model_data$directors1, p = .70, list = FALSE)
training <- model_data[ split,]
testing  <- model_data[-split,]

avg_rating_train <- training$averageRating
avg_rating_test <- testing$averageRating
training$averageRating = NULL
testing$averageRating = NULL

6.3 Preprocessing Data

preProc <- as.data.frame(training) %>% 
  preProcess(method = c("center", "scale", "knnImpute"))

# Columns affected by the preprocessing
colnames(preProc$data)

[1] "runtimeMinutes" "numVotes"       "budget_M"       "gross_M"       
[5] "net_profit_M"

training <- predict(preProc, training)
testing <- predict(preProc, testing)

Centering and scaling data is a must method before modelling to prevent outliers from influencing our predictions, plus we used knnImpute to better replace missing data, especially in the budget and gross columns, with values that are the most close to them.

dum1 <- dummyVars(" ~ .", data = training)
dum2 <- dummyVars(" ~ .", data = testing)

set.seed(8)
training_data <- data.frame(predict(dum1, newdata = training))
testing_data <- data.frame(predict(dum2, newdata = testing))

cat("Number of used predictors: ", ncol(training_data), "\n")

Number of used predictors:  1358

After preparing the genres and directors data, we should dummify the variables in order to include all possible values from the two columns.

Dummify the variables?

A dummy variable (also known as indicator variable or just dummy) is one that takes a binary value (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

For example, if we were studying the relationship between biological sex and income, we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for males and 0 for females (or vice versa).

In machine learning this is known as one-hot encoding.

Source.

6.4 Models Training

6.4.1 Linear Model

trCtrl <- trainControl(method = "cv", number = 10)

set.seed(8)
lm_model <- train(
  x = training_data,
  y = avg_rating_train,
  method = "lm",
  trControl = trCtrl
)

We will be using cross validation in order to estimate the prediction error and accuracy.

6.4.2 glmnet Model

set.seed(8)

lambda_vector <- 10^seq(5, -5, length = 500)

glmnet_model <- train(
  x = training_data,
  y = avg_rating_train,
  method = "glmnet",
  trControl = trCtrl,
  tuneGrid = expand.grid(alpha = c(0, 0.25, 0.5, 0.75, 1), lambda = lambda_vector)
)

In order to ensure the best results, we create a large enough lambda vector from which the model will choose the best value.

Lambda is a parameter used for regularization which means penalizing large coefficients in order to prevent overfitting.

Similarly for the alpha variable that controls whether the model will be using a Lasso (alpha = 1), Ridge regression (alpha = 0) or a mix between them two.

The final values used for the model were alpha = 1 and lambda = 0.01111968.

6.4.3 Decision Tree

set.seed(8)
dt_model <- train(
  x = training_data,
  y = avg_rating_train,
  method = "rpart",
  trControl = trCtrl,
  tuneGrid = expand.grid(cp = 0)
)

The tuning parameter here is cp which determines a threshold under which the split of a node is not worth the complexity.

6.4.4 SVM

set.seed(8)
svm_model <- train(
  x = training_data,
  y = avg_rating_train,
  method = "svmRadial",
  trControl = trCtrl,
  tuneGrid = expand.grid(sigma = 0.09323158, C = 2)
)

The tuning parameter here is mainly C, also known as Cost, that determines the possible misclassifications. It essentially imposes a penalty to the model for making an error: the higher the value of C, the less likely it is that the SVM algorithm will misclassify a point.

6.4.5 KNN

set.seed(8)
knn_model <- train(
  x = training_data,
  y = avg_rating_train,
  method = "knn",
  trControl = trCtrl,
  tuneLength = 21
)

The final value used for the model was k = 11.

6.5 Assessing Performance

summary(resamples(list(lm = lm_model, glmnet = glmnet_model,
                       tree = dt_model, svm = svm_model, knn = knn_model)),
        metric = c("RMSE", "Rsquared"))


Call:
summary.resamples(object = resamples(list(lm = lm_model, glmnet =
 glmnet_model, tree = dt_model, svm = svm_model, knn = knn_model)), metric
 = c("RMSE", "Rsquared"))

Models: lm, glmnet, tree, svm, knn 
Number of resamples: 10 

RMSE 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
lm     0.5247915 0.5526271 0.5704807 0.5687484 0.5850785 0.6219603    0
glmnet 0.4798164 0.5153451 0.5381637 0.5347212 0.5624312 0.5763364    0
tree   0.5271630 0.5440330 0.5580149 0.5750778 0.6077979 0.6407053    0
svm    0.3975482 0.4302592 0.4575501 0.4585265 0.4846814 0.5134466    0
knn    0.4685230 0.5370744 0.5628652 0.5568190 0.5813201 0.6442948    0

Rsquared 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
lm     0.4837828 0.5322703 0.5555410 0.5602782 0.5848216 0.6387821    0
glmnet 0.5269968 0.5778200 0.6040906 0.5951670 0.6161814 0.6605277    0
tree   0.4586580 0.5226818 0.5518375 0.5457920 0.5785684 0.6005557    0
svm    0.6401051 0.6710110 0.6910436 0.6977531 0.7232659 0.7683910    0
knn    0.4945690 0.5391819 0.5614550 0.5723706 0.6061618 0.6918937    0

We could conclude that all the models perform good enough based on the RMSE value which ranges on average between 0.46 and 0.58.

The decision tree model performs with an average ability to explain variable relationships based on the R-squared metric, on the brighter side the SVM model performed relatively good with minimal RMSE and good R-squared values.

In conclusion, we will be using the SVM model.

(This choice is confirmed by testing the model with the testing data)

svm_predictions <- predict(svm_model, newdata = testing_data)

postResample(pred = svm_predictions, obs = avg_rating_test)

     RMSE  Rsquared       MAE 
0.4892209 0.6384008 0.3668819

6.6 Testing the model with new real data

We will use 3 different movies to further test the model :

With distant values in influential columns, these 3 should a good challenge for our model.

testing_movies <- data.frame(
  runtimeMinutes = c(115, 158, 127),
  genres1 = as.factor(c("Comedy", "Action", "Comedy")),
  numVotes = c(58875, 147376, 50616), # as of the 3rd of July, 2024
  directors1 = as.factor(c("Richard Linklater", "Ridley Scott", "Sean Anders")),
  budget_M = c(8.8, 200, 75),
  gross_M = c(3.168, 221.381, 100),
  net_profit_M = c(-5.632, 21.381, 25),
  release_year = as.factor(c(2023, 2023, 2022))
)

Preparing the data to predict the average rating :

testing_v2 <- predict(preProc, testing_movies)

set.seed(8)
dum <- dummyVars(" ~ .", data = model_data %>% select(-averageRating))
testing_v2 <- predict(dum, testing_v2)

Results :

data.frame(Name = c("Hit Man", "Napoleon", "Spirited"),
           `Predicted Rating` = predict(svm_model, testing_v2),
           `Actual Rating` = c(6.9, 6.4, 6.6))

We could conclude that this model can be used with future movies to predict their ratings considering the average RMSE of 0.45 and the almost accurate predictions demonstrated above.